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Abstract 

o 

T— I This article presents a new classification framework that can extract individual features per class. The 

o 

^\i scheme is based on a model of incoherent subspaces, each one associated to one class, and a model on 

^~> how the elements in a class are represented in this subspace. After the theoretical analysis an alternate 

^) projection algorithm to find such a collection is developed. The classification performance and speed of 

^»«. the proposed method is tested on the AR and YaleB databases and compared to that of Fisher's LDA and 

'"^ a recent approach based on on ii minimisation. Finally connections of the presented scheme to already 

'^ ' existing work are discussed and possible ways of extensions are pointed out. 

u 

O Index terms: classification, feature selection, subspace learning, Grassmannian manifolds, alternate pro- 
jections. 



I. Introduction 



(^ A general approacli in classification is to select features of the signal at hand and to get a decision by 

o 

,__! comparing them to the equivalent features of already labelled signals with a simple classifier like nearest 

^ neighbour, e.g. |3], or nearest subspace, cp [^. This of course raises the question which features to take. 

For face recognition, which is the example we will use here, some classic and simple, because linear, 
features are Eigen, JT91, Fisher, 0, or Laplace features, fT\. However, as these classifiers are very simple 
and the features not adjusted to them, their performance is somehow disappointing, and researchers turned 
to the development of more complicated nonlinear features and kernel methods, lITOl . lITSll . 
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Here we start from the point of view that the potential of Unear methods and simple classifiers is not 
exhausted. In order to achieve better results, we propose to give up the uniformity of features over classes 
and mix the feature selection with the classifier. To motivate the idea of class specific features let us 
have a look at classical nearest neighbour (NN) and nearest subspace (NS) classification using linearly 
selected features and give it a new interpretation. 

Assume we have N already labelled training signals y G M'^ belonging to c classes, where each class 
i contains rii elements, i.e. ^^ rii = N. We denote the j-th signal in class i as yj, i = 1 . . .c, j = 
1 . . .nj. For each class i we collect all its training signals as columns in the d x rii class matrix Yi, 
i.e. Yi = {yj . . .yf'), and these class matrices in turn are combined into a big d x N data matrix 
Y = (Yi . . . Yc) = {yl ■ ■ ■ y"^ . . .yl . . . yc')- Given a new signal ymw the goal is to decide which class 
it belongs to with the help of the already labelled training signals. 

The classical first step is to select relevant features fnew from ynew via a Unear transform A, where A 
is a d X d matrix of rank r < d. 

Feature Selection: fnew = Aynew (1) 

The exact shape of the transform is determined by the training signals and their labels. For instance for 
Fisher's LDA A is chosen as the orthogonal projection that maximises the ratio of between-class scatter 
to that of within-class scatter, |]6|. 

In the second step these features are compared to the features // := Ayj of the training signals y^. In 
case of the nearest neighbour classifier this means that the new signal will get the label of the training 
signal which has features that maximally correlate with the features of the new signal, ie. 

NN Labelling: imw = argmax max | (// , fnew) \ ■ (2) 

i J 

If by analogy we define the feature matrix for a class i as Fi = AYi we can rewrite the expression, for 
which we are seeking the maximal argument, and combine the feature selection with the labelling step, 

max |(// , fnew)\ = max | {Ayj , Aynew) \ 

3 3 

= Til<%-iL\{A* Ayl,ynew)\ 
j 

= \\[A AYi) UnewWoo: w) 

where the matrix M* denotes the transpose of M, the p-norm of a vector is defined by \\v\\p := 
i^k'^(^)^)^^^ for 1 < p < oo and ||i^||oo := niax^ \v{k)\ and the qp-norm of a matrix by ||M||g^p = 
max||^ll^=i ||Afv||p. Thus another way of looking at the classification procedure is to say that for every 
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class we have a set of sensing signals A*Ayj and the new signal belongs to the class which has the 
sensing signal closest to it. From this point of view we also see that the scheme will work stably only if 
two conditions are fuUfilled. Every new signal is well represented by one vector in its class, i.e. a lot of 
its energy is captured by the projection on one vector, and no two sensing vectors from different classes 
are the same or close to each other, i.e. 

max\{A''Ayi,A*Ayi)\ = \\{A'' AYi)'' (A* AYk)\\i,oc < fi- (4) 

Let us do the same analysis for the nearest subspace classifier. Again the features of the new signal 
are compared to those of the training signals. For each class the features of the training signals in it 
span a subspace and the new signal will get the label of the class for which the orthogonal projection 
of the features of the new signal on the corresponding subspace has the highest energy. Let Qi be an 
orthonormal systenj^ spanning the subspace for class i, then i.e. 

NS Labelling: inew = argmax ||Q*/„e«,||2- (5) 

i 

Again we can combine the feature selection with the labelling by manipulating the expression, we want 
to maximise, 

IIQi/netulb = II (^ Qi) ynew\\2- 

If we compare to NN classification we see that again for every class we get a set of sensing signals, 
the columns of the matrix A*Qi, and that the new signal belongs to the class for which the sensing 
signals can take out the most energy (or for which the biorthogonal system to (AQi)* provides the best 
representation). Again this leads to two conditions for the classification to work, which are however more 
complex. First every new signal should be comparatively well represented in the biorthogonal system 
{Q*A)^ determined by its class and second no signal which is in the span of the sensing signals of one 
class should be well representable in the biorthogonal system of another class. 

max max WiA^Qi)^ A*Qjx\\2 = \\{A''Qi)^A''Qjx\\2,2 < M- (6) 

if^k ||a;||=l 

Summarising our findings for both nearest neighbour and nearest subspace classification we see that in 
both cases for every class we have a set of sensing signals or a subspace defined by the feature selection 
transform. There is a model how signals from this class are represented by this subspace, which implicitly 
determines which norm is used for the classification, || • ||oo for NN, || • ||2 for NS, and at the same requires 

'Qi can for instance be found via a (reduced) qr-decomposition of AYt 
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that the interaction of subspaces measured by a corresponding matrix norm is small, i.e. that they are 
incoherent. 

The classification scheme presented in this paper is based on the following idea. We give up the restriction 
that the subspaces associated to each class are generated canonically as a function of the feature selection 
transform and the training samples, i.e. A*AYi in the case of nearest neighbour classification, but generate 
them individually. This idea can also be motivated using the example of face recognition, to which we 
will apply our scheme later. Uniform feature extraction would mean realising that in general the most 
relevant parts of a face are the regions of the eyes, nose and mouth. Thus in order to classify a person we 
would focus on the eyes, nose and mouth regions while ignoring the hairstyle and comparing them to the 
eyes, nose and mouth regions of all the candidates. While this makes sense in general it will fail as soon 
as the set of candidates contains identical twins which can only be distinguished by the birth mark one 
has on his cheek. So while for most people the cheek is not a very distinguishing feature for the twins it 
is and it would be better to remember for them the cheek instead of for instance the nose. Even without 
the extreme example of the identical twins individual features are natural considering that the people we 
meet every day all have eyes, mouths and noses but not all of them have distinguishing eyes, mouths 
and noses. Instead they may have distinguishing birthmarks, scars, chins, etc. and a representation using 
these features will characterise them well but nobody else. 

In the next section we will introduce the mathematical framework on which we base our classification 
scheme. It consists of a model of subspaces associated to each class and a model of how the elements 
in this class are represented in this subspace, which together lead to a natural choice of the norm we 
have to use for the classification and an incoherence requirement on the subspaces. In Section [in] we 



will develop a comparatively simple algorithm to learn these subspaces from the training signal, which 



we will use to classify faces in Section IV In the last section we summarise our findings, point out 



connections to related approaches and outline possibilities for future work. 

II. Class Model 

The most general model for the subspaces we can think of is to ascribe to every class i a set of Si 
vectors //, j = 1 . . . Si, which are collected as columns in the matrix Fi = {fl . . . f^'-). These correspond 
to the features that characterise elements of this class, so every element yi in class i can be written as a 
combination of these class specific features with coefficients Xi and some residual ri, orthogonal to the 
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feature span, 

yi = FiXi + ri, r'^±sp{Fi). (7) 

The condition that features of a class well characterise the elements in it translates into a property of 
the coefficients Xj, i.e. when measuring their strength in some norm it is higher than the strength of the 
coefficients we would obtain trying to represent the element by features of the wrong class. Since without 
further restrictions on the set of features per class it is not straightforward to calculate the coefficients of 
the best representation of signal in a class, we will sacrifice generality for simplicity and for the rest of 
the analysis assume that for every class we have the same number of features s and that they form an 
orthonormal system, i.e. F*Fi = Is- We will point out how to deal with the more general situation in the 
last section. Given that the features of each class form an orthonormal system we can easily calculate 
the coefficients of the best representation of a general signal y in class i as F*y. The question is now 
how should we measure these coefficients in order to correctly classify our images, i.e. which norm || • || 
should we choose such that for all yi in class i we have 

< 1, Vj / i. (8) 



\F^y^ 



II F*ii- 

To answer this question we will introduce three models on the coefficients, each leading to a certain 
p-norm as optimal measure. 

A. Sparse Coefficients 

Assume that all signals we want to classify can be well represented by one element of one class, i.e. 

yi = FiXi + Ti with ||a;j||o = 1, (9) 

where || • ||o counts the number of non- zeros entries. An example for this situation would be trying to 
sort pictures of monkeys, snails, cucumbers and broccoli into animal and vegetable pictures. Even though 
monkeys and snails are both animals their shapes are very different, meaning that we can think of them 
as orthogonal, and the same goes for the shapes of cucumbers and broccoli in the other class. Let x be the 
absolute value of the only non-zero component of the coefficients Xj. We immediately see that whatever 
p-norm we apply using the correct class the response is always equal to x, \\F*yi\\p = ||xi||p = x. 
Therefore to find out which p-norm is best we will use a trick that involves estimating the ratio we need 
to be smaller than 1 for successful classification with the triangular equation and a matrix norm bound. 

So for general 1 < p, g < oo we get, 

II F*ii-\\ lU II II F*r-\\ 

IK i y^wp ii'*'*iip ii'''*iip 
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In the special case where the coefficients are sparse and thus \\x\\g = ||x||p,Vp, g, this means that 

II W^ll- II II W^T- II 

-*■ A t/7 7) \\ -^ A ' 1 \\Ti 

II jmwp \\p*p.\\ I II ] *IIP 



II T^-k II — 7 * y,/^ ' 

W r tl-W ■^ T 

IK j yillp -^ 

The smallest (/p norm of a matrix is obtained when p = oo and q = oo. Then it corresponds to the 
maximal absolute entry of the matrix F*Fi, i.e. the maximal absolute correlation between to features 
from different classes. Since in that case also the response from the residual ||-Fj*rj||p is minimal we get 
the best bound choosing the oo-norm for the classification. Summarising our findings we see that in case 
of a sparse model on the coefficients, the oo norm is optimal and that the incoherence requirement we 
get for classification to work stably is that no two features from two different classes are too similar, but 
it does not matter if a feature is moderately close to all features in a different class or even representable 
by them. Thinking to the example of the animal vs. vegetable pictures this means that even though you 
can approximate the shape of a snail combining the shape of the cucumber and the broccoli, classification 
using the co norm will work well because no animal shape alone closely resembles a vegetable shape 
and vice versa. 

B. Flat Coefficients 

Let us now assume the completely opposite distribution of the coefficients, i.e. to represent one element 
in a class we need to combine all features of that class with equal magnitudes, i.e. 

yi = FiXi + ri with \xi{k)\ = x, forA; = l...s. (11) 

An example would be trying to label pictures of national flags and with the corresponding countries. 
For simplicity assume that the only flags in question are those of the Netherlands, Germany, Estonia, 
Lithuania and Gabon, which all consist of three horizontal stripes in various colours, i.e. red, white and 
blue for the Netherlands, black, red and yellow for Germany, blue, black and white for Estonia, yellow, 
green and red for Lithuania and green, yellow and blue for Gabon, cp. Figure [T] Good features in this 
example are the colours of the stripes. Each national flag has its three distinctive colours which appear 
in an equal amount but are not exclusive to this flag. 

In this case we get the maximal response from the correct class when choosing p = 1, i.e. ||a;||i = sx. 



Also from Inequality (10 1 we see that using q = oo and p = 1 gives a very beneficial bound. 



WtViWi ~ s sx ' 
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Netherlands Germany Estonia Lithuania Gabon 

Fig. 1. National Flags 



Remembering that ||-F*-Fj||oo,i is smaller than the absolute sum of all the correlations between features 
in one class and features in another class we get a less sharp version of the above bound, 



ll-^*yi||i S SX 

which shows that for the case of flat coefficients we have a quite different coherence constraint. Even 
if a few features in a class are very close to features in another class or actually the same this is not a 
problem as long as the majority of features from two different classes are not very correlated. 
In the example of the flags this means that even though two different national flags might share up to 
two colours, as long as we take into account that all three colours have to appear to the same degree, 
we can still identify the country from a picture of the flag. 

C. Unstructured Coefficients 

The last case we are going to discuss is probably the most common and concerns coefficients which 
follow neither of the two extreme distributions discussed above or where the exact distribution is unknown. 
An example is the task of face recognition, i.e. identifying a person from a picture. Obvious features 
in this case are noses, eyes and mouths. In any picture most of these features will be visible but their 
strength will largely depend on the facial expression and lighting conditions. To choose a good p-norm 
for the classification in this case, we again bound the norm ratio we need to be small. 

IIF*7/-II \\F*F-T-\\ llF*r-ll 



ll-f'jyiiip ii'^«iip ii'^Jiip 

Since we do not have information about the shape of the coefficients, the first term on the right hand 
side can be as big as ||F*Fj||p^p = max||^||^=i ||F*Fj||p. Taking into account the orthogonality of the 
features in the matrices Fj, we see that for p = 2 this term can only be equal to one if two classes 
overlap, meaning that there is a signal whose features in its own class can be represented by features 
in a different class. For p = l/oo, however, the corresponding term is equal to the maximum absolute 
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column/row sum of the F*Fj and it can be easily seen that this can be larger than one, even if for 
no signal the features in its own class can be fully represented by features in a different class. Similar 
results hold for all other p / 2, thus making p = 2 the best choice in this case. Observe also that p = 2 
corresponds to measuring the energy captured by the features of a class. Thus if the features are well 
chosen also the second term in Inequality ([12]) can be expected to be small. 

Finally we see that choosing p = 2 puts the following incoherence constraint on the feature spaces. No 
signal that can be constructed from features in one class should be well representable by features in 
another class. This constraint is the strongest we have encountered so far, which is only natural since we 
do not have an assumption on coefficient distribution. Coming back to our example it also corresponds 
quite naturally to what one would expect from face recognition, ie. that in all pictures enough distinctive 
features are visible and no matter the lighting condition or facial expression two people can always be 
uniquely identified from their features. 

Of course there is ample opportunity to develop more class models, assuming different distributions on 
the coefficients and using more exotic norms. Also one could use different assumptions on the features, 
i.e. non-orthogonal. However, in this paper we will focus on finding a practical way to calculate sensing 
or feature matrices for classification based on the three main models. 

III. Finding Feature/Sensing Matrices 

From the analysis in the last section we can derive two types of conditions that the collection of 
features or subspaces Fi needs to satisfy. The first type describes how features from different classes 
should interact, i.e. the interplay measured in the appropriate matrix norm should be small, and the second 
type how the features should interact with the training data, i.e. the ratio of the response without to within 
class should be small. The problem with both kinds of conditions is they are not linear and difficult to 
handle. For instance calculating the (2, 2)-norm is equivalent to finding the largest singular value and 
calculating the (oo, l)-norm is even NP-hard. We will therefore start with a very simple approach that 
will lead to a reasonably fast algorithm, and in the last section point out how to extend it to include more 
complicated constraints. Instead of requiring explicitly that the interplay between features from different 
classes is small, hereby avoiding to investigate what small means quantitatively, we use the intuition that 
this should come as free side effect from regulating the interaction with the training data, and simply 
ask that F is a collection of orthonormal systems Fi each of rank s. What we would actually like to do 
about the interaction of the features with the training data is to minimise the ratio between the response 
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of the training data without to within class. However, a constraint involving the ratio is not linear and 
very hard to handle. We will therefore split it into two constraints that guarantee that the ratio is small 
if they are fulfilled. The first constraint is that the response within class is equal to a constant /3p which 
we choose to be the maximally achievable value given the rank of the orthonormal systems and p. The 
second constraint is that the response without class is smaller than a constant fip, whose dependence on 
s,p,d is more complicated and will be discussed later. Define the two sets Ts and T^ as 

J-, :={F=(Fi,...,F,): i^^F. = /.} 



■'fj. ■— i^ ■ ll-f^i til lip — Pi 



\\F*yf\\p<fip,yk,i,j^i}, (13) 

then our problems could be summarised as finding a matrix in the intersection of the two sets, i.e. 
F G Fs r\ F^. However, since this intersection might be empty, we should rather look for a pair of 
matrices, each belonging to one set, with minimal distance to each other measured in some matrix norm, 
eg. the Frobenius norm, denoted by || • ||J^ 

min \\Fs - F42 s.t. Fs € Fs, F^ G F^. (14) 

One line of attack is to use an alternate projection method, i.e. we fix a maximal number of iterations, 
an initialisation for F^ and then in each iterative step do: 

• find a matrix F^ G argminp^gjr H-Fg'^"^ — -^11 2 

• check if H-F^'^"^ ~ ^» II2 is smaller than the distance of any previous pair and if yes store Fg~^ 

• find a matrix F^ G argmin^^^jr^ ll-^u ~ -^l|2 

• check if H-^i^ — i^^||2 is smaller than the distance of any previous pair and if yes store F^ 

If both sets are convex, the outlined algorithm is known as Projection onto Convex Sets (POCS) and 
guaranteed to converge. Non convexity of possibly both sets, as is the case here, results in much more 
complex behaviour. Instead of converging, the algorithm just creates a sequence {F^,Fg) with at least 
one accumulation point. We will not discuss all the possible difficulties here but refer to |[T8l . where all 
details, proofs and background information can be found and wherein the authors conclude that alternate 
projection is a valid strategy for solving the posed problem. 

To keep the flow of the paper, we will not discuss the two minimisation problems that need to be 
alternatively solved here. The interested reader can find them, including the exact parameter settings in 

^We use this notation instead of the more common variant 11 ■ 11 f to avoid confusion. 
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the simulations of the next section, in the appendix. Instead we will discuss how to set the parameters 
Pp,fj-p and possible choices for the initialisation F^. 

As mentioned above we choose (3p to be the maximally achievable value. An orthonormal system of s 
feature vectors can maximally take out all the energy of a signal, 

\\F*yih < lly^lb- (15) 

As the signals are assumed to have unit norm, this energy is at most one and we set /32 = 1. The maximal 
1-norm of the vector F*yi of length s with energy 1 is ^/s. This is attained when all features of one 
class take out the same energy, i.e. the absolute values of the entries in F*yi are all equal to l/Vs. This 
leads to /3i = ^/s. The infinity norm F*yi corresponds to the maximal inner product between one of the 
feature vectors and the signal. As both the feature vector and the signals are normalised, this can be at 
most one and so we set /3oo = 1- 

From the discussion in the last section we see that the parameter fi reflects the incoherence we require 
between features from different classes. If we have d> c-s, it is theoretically possible to have c subspaces 
of dimension s which are mutually orthogonal to each other, and /i could be zero. As soon as the above 
inequality is reversed, because for instance the actual dimension of the span of all features, i.e. rank{F), 
is smaller than d, not all subspaces corresponding to the different classes can be orthogonal but will have 
to overlap. How the size of this overlap, i.e. coherence, should be measured, is determined by the choice 
of p-norm for classification. For instance f or p = 2 the coherence was measured by ||-F*-Fj||2,2 and from 
theory about Grassmannian manifolds, see lITSll . we know that the maximal coherence between two of c 
subspaces of dimension s embedded in the space M.'^ can be lower bounded by 



■ 2 s ■ c — d 



max\\F*Fi\\l2>-r, tt- (16) 



The problem with setting fi as above is that we are not controlling the interaction between the sets of 
features directly but only indirectly over the training data. There the worst case might not be assumed and 
so fi as above would be too large. Also for the cases p = 1, oo we do not have a similar bound. Therefore 
instead of trying to analyse theoretically how to set jj., where we have to deal with too many unknowns, 
we use the above bound as an indication of order of magnitude and, when testing our scheme on real 
data, vary the parameter fi. Lastly for the initialisation for each class we choose the orthogonal system 
that maximises the energy taken from this class opposed to the energy taken from the other classes, i.e. 

F", = argmin ||F*y,||i - J] ||i^^y,||i. (17) 
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This problem can be easily solved, by considering the rewritten version of the function to minimise, 

^min^^ trace [f:{Y,Y: - J] Y,Y[)F,) . (18) 

If UDU* is an eigenvalue decomposition of the symmetric (Hermitian) matrix YiY* — J2j=^i'^j'^-j'' 
then the minimum is attained for F^^ consisting of the s eigenvectors corresponding to the s largest 
eigenvalues. 

IV. Testing 

To test the proposed scheme we use two face databases, the AR-database, |[T3l and the extended Yale 
B database, m. First we will test the validity of all three approaches on the AR-database, even though it 
is intuitively clear that the most appropriate model for faces corresponds to p = 2. Using the experience 
from the AR-database we will then run similar tests on the extended Yale B database using only the 
most appropriate model p = 2. 

A. AR-Database 

For the test we used a subset of images from the AR-database. For each of the 126 people there 
are 26 frontal images of size 165 x 120 taken in two separate sessions. The images include changes 
in illumination, facial expression and disguises. For the experiment we selected 50 male and 50 female 
subjects and for each of them took the 14 images with just variations in illumination and facial expression, 
neutral, light from the right and left, front light, angry, happy, sleepy. The all together 700 images from 
the first session were used as training data and the 699 imageqj from the second session for testing. 
Every image was converted to grayscale and then stored as a 19800 dimensional column vector. The 
images from the first session were stored in the 19800 x 700 matrix Y^ and those from the second in 
the 19800 X 699 matrix Y^. All images (columns) in y^ were re-scaled to have unit norm. In order to 
speed up the calculations, we first applied a unitary transform, which does not change the geometry of 
the problem, but reduces the size of the matrices, i.e. we did a reduced Qii-factorisation decomposing 
Y^ into the 19800 x 700 matrix Q with orthogonal columns and the 700 x 700 upper triangular matrix 
R and set Y^ = Q*Y^ = R and Y^ = Q*Y'^. 

We tested the proposed scheme for all three choices of p and varying values of ^p scaling from to 10% 
of j5p and number of features per class varying from 1 to 7. The choice of the maximal outside-class 

^700 minus corrupted image w-027-14.bmp 
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TABLE I 

Number of misclassified images on the AR-database FORp = 1 and varying values s and fi. 
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TABLE II 

Number of misclassified images on the AR-database for p = 2 and varying values s and ^. 



contribution fj-max = 0.1/3p was inspired by the bound in ( |T6| ). If we take as effective signal dimension 
d = 700 and assume that the space should not only accommodate the 100 different people in our training 
set but all people, i.e. we let c go to infinity, the bound approaches \/s/d which is 0.1 if s = 7 and 
0.0378 if s = 1. The maximal number of features per class is 7, since we only have 7 test images and 
so it does not make sense to look for spaces of higher dimension containing all test images. Note also 
that for s = 1 the three schemes are the same, so the results are only displayed once. For each set 
of parameters we calculated the corresponding feature matrix using the algorithm described in the last 
section on the images from the first session. We then classified the images from the second session using 



the appropriate p-norm. The results are shown in Tables Hi O and III 



As we can see we get the best performance for p = 2, followed by p = 1 and p = oo. This comes 
as no surprise when considering the structure of our data. Intuitively the important features of a face are 
eyes, nose and mouth. Since the people in the pictures have different facial expression, usually not all 
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TABLE m 

Number of misclassified images on the AR-database FORp = oo and varying values s and p. 



of these features will be active explaining why p = 1 is not the most appropriate model. On the other 
hand we can expect to have more than one feature active at the same time even if not to the same extent. 
Using p = oo we lose the information given by these secondary active features while with p = 2 we still 
incorporate it into the final decision. 

We can also see that 0.1% of fj, as maximally allowed outside class 'energy' seemed to have been a 
good choice as we can always see a small decrease and large increase of the error going from to 0.1, 
with the best range for p = 1 and p = 2 between 0.01 and 0.03 and for p = oo between 0.02 and 
0.06. For p = 1 we get better performance for the lower dimensions, which seems reasonable because 
there the equal energy distribution over the features is easier achieved. For p = 2 on the other hand the 
better performance is achieved with higher dimensions, which are able to capture more important side 
details. Finally f or p = 2 the results seem equal for all dimensions. A possible explanation is given by 
the initialisation, which ensures that for all dimensions the first, most promising direction is included. 
Still in all three cases in the most promising ranges the proposed scheme outperforms a standard method 
like Fisher's LDA, |6]. The best result by LDA is obtained when using the original (not-normalised) 
images and the highest possible number of discriminant axes c — 1 = 99. In this case nearest neighbour 
classification, corresponding to p = oo but with non orthogonal features, fails to identify 59 images, and 
nearest subspace classification, corresponding to p = 2 fails to identify 71 images. When concentrating 
on the results for p = 2, which is the most sensible choice given the structure of the data, p = 2, we also 
see that the scheme performs well in comparison to a recent, successful method based on £i minimisation, 
||20l . The best result reported there is a success rate of 94.99%, meaning 35 misclassified images, which 
is 5 images better than our best case of 40 errors. 
Encouraged by the promising results we now turn to testing our scheme on the extended Yale B database. 
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26.25 ± 6.61 
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14.15 ±4.37 


13.60 ± 4.22 


13.85 ± 3.73 


15.85 ± 5.25 


16.40 ± 4.78 


17.55 ± 6.00 
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15.75 ± 3.82 


14.05 ± 3.49 


13.95 ± 3.55 


15.35 ± 3.95 


16.45 ± 4.10 


16.95 ± 4.30 


5 


15.70 ± 4.78 


15.00 ± 4.91 


14.45 ± 4.30 


15.30 ± 3.34 


17.60 ± 4.65 


17.65 ± 4.55 



TABLE IV 

Mean ± standard deviation of misclassified images on the Extended Yale B database for p = 2 and 

VARYING VALUES S AND ^. 



B. Extended Yale B Database 

From the extended Yale B database we used the 2414 frontal face images, about 64 images taken under 
varying illumination conditions for each of the 38 people. For the test we randomly split the set of images 
per person into an equal number of training and test images, using one more training than test image in 
case of an odd number of images per class. We then ran our classification scheme with the number of 
features per class varying from 2 to 5 and thanks to the experience gained from the AR-database with 
the values of fi running only from to 0.05. For the computation of the feature matrices we used the 
same simplifications as described for the AR-database. For comparison we ran Fisher's LDA with 37 and 
30 discriminative axes in combination with the nearest neighbour classifier. This procedure was repeated 
19 times and the mean of all 20 runs was computed. 



The results of our method can be found in Table IV While Fisher's LDA on average missclassified 
23.30 lb 6.42 images (success rate of 98.07 it 0.53%) using 37 discriminant axes and 231.55 ± 23.48 
images (success rate 80.78 ± 1.95%) using 30 discriminant axes, our method in the best case only 
misclassified 13.60 ± 4.22 images (success rate 98.87 it 0.35%). In general it outperformed Fisher's 
LDA for a wide range of values for /i and s. 

Comparison to the ^i -minimisation scheme in |[20l is harder, as it seems that there only a single run was 
used. However, their best success rate of 98.26%, achieved at the same time as Fisher's LDA with 30 
discriminant axes achieved 87.57% (the maximal rate for Fisher's LDA we encountered in 20 runs was 
84.73%), is still below our best average rate of 98.87%. 



To illustrate the results in Figure [2] and to confirm the motivation in the introduction for using different 
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features for different classes, we show what happens to the training images of two different subjects when 
projected on the features of their own class and the other subject's class. As expected the projections on 
features of their own class nicely filter out common traits like eyes, mouths and noses, but on top of that 
the features of the first subject capture the very distinctive birth mark on his right cheek. The projections 
on the wrong class on the other hand are not only much weaker (note the difference in scale) but also 
less clear. Two overlapping sets of features seems to appear at the same time, the ones that belong to 
the subject in the image and the ones that the projection is trying to filter out. 
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Fig. 2. Images of two subjects, original (a) & (d), projected onto the span of features from their own class (b) & (e), projected 
onto the span of features of the wrong class (c) & (f) 



Summarising the results, we can say that our method outperforms a classic scheme like Fisher's LDA. 
In comparison to the £i -minimisation scheme in ||20l it performs slightly worse on the AR-database but 
seem to be better on the YaleB-database. However it has one big advantage over the ^i -minimisation 
scheme, which is its low computational complexity. Not taking the calculation of the feature matrices 
into account, as this is part of the pre-processing, basically all that has to be done to classify a new data 
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vector is to multiply it with the feature matrix and calculate some statistics on the resulting vector. The 
£i minimisation method on the other hand requires on top of extracting the features the solution of a 
convex optimisation problem 

min llzlli s.t. \\fnew - Fz\\2 < e, (19) 

where F in this case is the dj x N matrix containing the features of all the training data. For comparison 
in II20II the authors state that the classification of one image takes a few seconds on a typical 3 GHz 
Pc. At the same time for classifying 1205 images of size 192 x 168, using our method with 4 feature 
dimensions per class, MATLAB takes less than half a minute on a Dual l.SGhz PowerPC G5, which is 
less than 25ms per image. 

In the next section we will introduce the mathematical framework on which we base our classification 
scheme. It consists of a model of subspaces, one associated to each class, and a model on how the 
elements in this class are represented in this subspace, which together lead to a natural choice of the 
norm we have to use for the classification and an incoherence requirement on the subspaces. 

V. Discussion 

We have presented a classification scheme based on a model of incoherent subspaces, each one 
associated to one class, and a model on how the elements in a class are represented in this subspace. From 
a more practical viewpoint we have developed an algorithm to calculate these subspaces, i.e. the feature 
matrices, and shown that the scheme gives promising results on the AR database and even outperforms a 
state of the art method like the £1 -minimisation scheme in ll20l on the YaleB -database. The idea that each 
class should have its own representative system, learned from the training data can already be found in 
lITTl . There frames or dictionaries for texture classification are learned, such that each provides a sparse 
representation for its texture class. The new texture then gets the label of the texture frame providing the 
sparsest representation. In (121, the same basic idea is used but the learning is guided by the principle that 
the dictionaries should also be discriminant, while in fl^ both learning principles are combined, i.e. the 
dictionaries should be discriminant and approximative. This third scheme can be considered as a more 
general and more complicated version of our approach. Alternatively our approach can be considered to 
be a hybrid of Nearest Subspace respectively Nearest Neighbour and the discriminative and approximative 
frame scheme, in so far as it is linear but has individual features for every class. 

The idea to use a collection of subspaces for data analysis can also be found in ifTTl . where the 
subspaces are used to model homogenous subsets of high-dimensional data which together can capture 
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the heterogenous structures. 

For the future there remain some interesting directions to explore. Firstly the possibilities of the subspace 
classification approach do not seem exhausted using the proposed algorithm. Ironically this fact revealed 
itself through a bug in the minimisation procedure, resulting in matrix pairs with distances larger than 
the optimal ones, and sensing matrices giving better classification results, i.e. in the best case an error of 
only 35 misclassified images. The main difference of these fake optimal matrices to the sensing matrices 
corresponding to the actual minima, seemed to be that, while capturing approximately the same 'energy' 
within class, they were more accurate in respecting the without class energy bound, i.e. less overshooting 
of the maximally allowed value /i. This overshooting for the real minima is a result of imposing not only 
ll-^iyJ^lb < A* but also ||-Fjyj^||2 = /3, which forces the optimal feature matrix to balance the error incurred 
by not attaining /3 within class and the error incurred by being larger than fi without class. A promising 
idea to avoid the overshooting would be to change the problem formulation and ask to maximise the 
'energy' within class subject to keeping the 'energy' without class small, i.e. in the case p = 2 solve, 

max^\\F*Y,\\l 
i 

s.t. F^Fi = Is and 

Lastly our approach allows to impose additional constraints on F, like incoherence of the subspaces 
between each other, e.g. ||-Ff-Fj||2,2 < i^ for p = 2, or low rank of the whole feature matrix to reduce the 
cost of calculating F*ynew Another possibility to reduce computational cost if d and N are very large, 
especially in the training step, would be to first take random samples of the training data, which reduce 
their dimension but very likely preserve the geometrical structure, as described in 121 and used in EOl . 
Alternatively to reduce the dimension of F one can apply our scheme on classical features, like Eigen 
or Laplace features, instead of directly on the raw training data. 

Appendix 

In order to use the alternate projection method for calculating the feature matrices we need to find 
the projection of a matrix F onto Fg and onto J^^ in the three cases p = 1,2, oo. We will start with the 
easier of the two problems 

find: Fg G argmin ||F — -F||2- (20) 
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Since the minimisation problem is invariant under squaring of the objective function and thus equivalent 
to 

c 

min IIF-Fllo = min V||Fi-Fj||o, (21) 

i=\ 

it splits into c independent problems 

min ||Fi-Fi||o. (22) 

The solution of these problems is straightforward. If Fi has the reduced singular value decomposition 
Fi = UiSiVi then the orthonormal system Fi of same rank closest to it is Fi = UiVi, see e.g. [8j|. 
The second minimisation problem 

find: F^ € argmin ||F — -F||2- 

is more complicated to solve. Assume that the number of training signals is larger than the dimension 
of the signals and span the whole space, so that the d x N matrix Y has rank d < N. li not we 
embed the training signals into a lower dimensional space corresponding to the rank of Y via a reduced 
Qi^-decomposition of Y and set Y = Q*Y = R before starting the alternating projection procedure. 
Afterwards we set F = Q*F, where F is the feature matrix calculated from the lower dimensional 
embedded data. Since Y has rank d we have YY'^ = Id and can reformulate the problem to solve as 

min IIF - Fh = min \\(F*Y - F*Y)Y''\\2. (23) 

The advantage of this formulation is that it is in terms of F*Y, which is also used to describe F^. To 
further exploit this property we define the set t/^, which is of the form F*Y with F G 7"^. To characterise 
the set Q^ we assume the following notation. Let dj refer to the s x rij submatrix that corresponds to 
F*Yj inside F*Y and denote the k-th column of Gy by Gij{:, k). We can then define 

g^:={G:\\Gii{:,k)\\p = Pp. 

||G,,(:,A;)||p</ip,VA:,i,j/i}. (24) 



Set G = F*Y then the problem in (23 1 is equivalent to 



min||(G-G)yt||2. (25) 

To attack this problem we will use resolvents or proximity operators which are a generalisation of 
projection operators. Given a Hilbertspace Ti and a function / from ^ to ] — oo, +oo] that is lower 
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semicontinuous, convex and not identical to +00, i.e. belonging to Tq{'H) the proximity operator proxj 
is defined by 

proxj(x) = argmin/(y) + -\\x - y|||^. 
H ^ 

Proximity operators were first studied by Moreau in [14], who developed a theory of proximal calculus, 
and recently have been used to solve optimisation problems in signal processing, [4|. Here we will use 
the forward backward splitting approach as described in [5]. Assume that we can write the function to 
minimise as the sum of two functions /i, /2 in ro(^), i.e. 

min/i(x) + /2(x). (26) 

If /2 is differentiable with a /3-Lipschitz continuous gradient for /3 > then the sequence generated by 
fixing xq £ T-L and iterating 



X 



n+l 



prox^„^^(x"-7"V/2(x")) (27) 



converges weakly to a minimum of (26 1 if 7 < 2//3. 

To apply the concept to our problem we take as Hilbert space the set of all c • s x A'^ matrices G equipped 

with the Frobenius norm and define the indicator function Ig of the set Q^ by 

1 if GGGf, 



+00 else 



The we can replace problem ([25]) by 



minIg^(G) + ||(G-G)yt||2. (28) 

The slight imperfection of this approach is that the set Q^ is not convex, therefore Ig (G) is not convex 
and the sequence generated applying ( [27] ) is not guaranteed to converge. Finding only a local minimum 
is however not such a big problem, since the procedure is only part of a bigger iterative scheme, as long 
as in each step we get some improvement. 

What remains to be done is to calculate the proximity operators for 7/1 = 7!^^^ = Ig^, the gradient of 
f2{G) = \\{G — G)yt||2 and decide about the initialisation Go and the step sizes 7". A straightforward 
calculation shows that V/2(G) = 2{G — G)Y'^ {Y'^Y . Since Ig is an indicator function the proximity 
operator is simply the orthogonal projection onto Q^, i.e. 

argminig {G) + \\\G'' - G\\l = argmin ||G" - G||l 

G " 2 G<^g 
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Because of the structure of ^^, see (24i, the problem above spUts into the smaller problems 



min \\Gl{:,k)-Gii{:M\l'^i 

||Gi.(:,A:)||p=/3p 

and „^ mm ||Gr,-(:, fc) - G,,(:, A;)||2, Vi / j. 

In other words for p = 1, 2, oo we need to solve problems of the form 

min ||(7 — /1II2 and min ||5 — /ijli- (29) 

l!9ilp=^P ll9l|p<Mp 

The solutions are collected in the following Theorem. 

Theorem 1: Denote by gp the minimal argument of the first problem and by g^^ the minimal argument 
of the second problem in ([29]). 
p = 1 : Set a{i) = sign(/i(i)) if h{i) / and cr(i) = 1 else, and denote by m the length of the h, then 



gj3^{i) = h{i) + (j{i)\, where A 



m 



If ll^lli < /^i set g^^ = h. Otherwise set g^ = h and iteratively shrink 

5^^(^) = a(z)max(|/-l(^)|-A^0), 

II fc— 111 
where A'^ = "\ ^~^ . 

until g^ with H^'^Hi = [i is found and set gf^^ = g^ . 
p = 2: 

h 



9(32 = P2 



2 

h if ||/l||2 < /U2 



5^2 , , 

p = c« : Let imax be the index of (one of) the largest absolute component of h then 

1 if i = ir 
h{i) else 

, MO if|/i(i)|</ic 

/Xoo else 

Lastly as initialisation G^ we choose the projection of G onto ^^, i.e. G*^ = prox^ (G). Finding the 
correct step-sizes is usually a matter or trial and error. For the application considered here we used 
7" = ||G"||2/(20||V/2(G")||2), which worked better for small /^, and 7" = 1/||V/2(G")||2, which 
worked better for large ^. The iteration was stopped when the relative improvement in each step was 
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below 10"^. The number of iterations for the alternative projections was set to 10. 

Thanks: We would like to thank John Wright for helping us getting the cropped version of the AR- 
database faces. 
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